We are trying to understand the happiness distribution of the world. We would find out the relationships between happiness and economy and health indices to answer the question that which factors are significantly related to happiness. Furthermore, we would explore how the factors that determine happiness are similar or different in each countries as well as in different regions. It is essential to use scientific underpinnings of measuring to understand subjective well-beings around the world.
We use the The World Happiness Report 2018 from the United Nations, which is a landmark survey of the state of global happiness. Based on the pooled results from Gallup Would Poll surveys, the World Happiness Report includes its usual ranking of the levels and changes in happiness around the world. The complete online data set is included in Chapter 2 of the Report. The first sheet includes the data of 17 variables for the recent decade of 141 countries. But here we only use the data of 141 countries in 2017 to process an analysis (df1).
dplyr::glimpse(df_2017)
## Observations: 141
## Variables: 19
## $ country <fctr> Afgh...
## $ year <int> 2017,...
## $ Life.Ladder <dbl> 2.661...
## $ Log.GDP.per.capita <dbl> 7.460...
## $ Social.support <dbl> 0.490...
## $ Healthy.life.expectancy.at.birth <dbl> 52.33...
## $ Freedom.to.make.life.choices <dbl> 0.427...
## $ Generosity <dbl> -0.10...
## $ Perceptions.of.corruption <dbl> 0.954...
## $ Positive.affect <dbl> 0.496...
## $ Negative.affect <dbl> 0.371...
## $ Confidence.in.national.government <dbl> 0.261...
## $ Democratic.Quality <dbl> NA, N...
## $ Delivery.Quality <dbl> NA, N...
## $ Standard.deviation.of.ladder.by.country.year <dbl> 1.454...
## $ Standard.deviation.Mean.of.ladder.by.country.year <dbl> 0.546...
## $ GINI.index..World.Bank.estimate. <dbl> NA, N...
## $ GINI.index..World.Bank.estimate...average.2000.15 <dbl> NA, 0...
## $ gini.of.household.income.reported.in.Gallup..by.wp5.year <dbl> 0.286...
In the World Happiness Report, the following 10 variables are included in the analyses (Technical Box 1 and Appendix Table A1):
Life Ladder: self-anchoring score of an individual. On which step of the ladder people would say that they personally feels they stand at this time. On the ladder, 0 represents the worst possible life and 10 represents the best possible life.
Log GDP per capita:GDP per capita growth measurement. GDP per capita in terms of Purchasing Power Parity (PPP) adjusted to constant 2011 international dollars, taken from the World Development Indicators (WDI) released by the World Bank in September 2017.
Social support:the share of people reporting that they have friends or relatives whom they can count on help in case of need.
Healthy life expectancy at birth: constructed based on data from the World Health Organization (WHO) and WDI.
Freedom to make life choices: the percentage of people answering “yes” to the question whether they are satisfied or dissatisfied with your freedom to choose what you do with your life.
Generosity: the residual of regressing the national average of GWP responses to the question “Have you donated money to a charity in the past month?” on GDP per capita.
Perceptions of corruption: the percentage of people answering “yes” to the question whether corruption widespread throughout the government in this country, or not and whether corruption is widespread within businesses or not.
Positive affect: defined as the average of laughter and enjoyment for other waves where the happiness question was not asked.
Negative affect: defined as the average of previous-day affect measures for worry, sadness, and anger for all waves.
Confidence in national government: the percentage of people answering “yes” to the question whether they have confidence in the national government of the country.
The online dataset also includes Democratic Quality, Delivery Quality, Standard deviation of ladder by country-year, Standard deviation/Mean of ladder by country-year, GINI index (World Bank estimate), GINI index (World Bank estimate), average 2000-15 gini of household income. GINI index is a measure of statistical dispersion that is intended to represent the income or wealth distribution of the country’s residents, which is also a measurement of inequality.
The Report also includes the dataset of region indicator, the average and standard deviation of Life ladder, Log of GDP per person, GDP per person, Healthy life expectancy, Social support, Freedom to make life choices,Generosity (without adjustment for GDP per person),Perceptions of corruption variables of each country. The region indicator includes Sub-Saharan Africa, East Asia, North America and ANZ, Western Europe, South Asia, Southeast Asia,Central and Eastern Europe, Middle East and North Africa, Commonwealth of Independent States, and Latin America and Caribbean. We import this dataset as df2.
We also import the Happiness score as an separate data frame (df3).
Describe any variable transformations, treatment of missing values, recording and any other data manipulations completed.
First, we use the mymiss function to see the patterns of missing data.
aggr(df_2017)
mymiss <- function(x)sum(is.na(x))
sapply(df_2017, mymiss)
## country
## 0
## year
## 0
## Life.Ladder
## 0
## Log.GDP.per.capita
## 7
## Social.support
## 1
## Healthy.life.expectancy.at.birth
## 0
## Freedom.to.make.life.choices
## 1
## Generosity
## 8
## Perceptions.of.corruption
## 12
## Positive.affect
## 1
## Negative.affect
## 1
## Confidence.in.national.government
## 13
## Democratic.Quality
## 141
## Delivery.Quality
## 141
## Standard.deviation.of.ladder.by.country.year
## 0
## Standard.deviation.Mean.of.ladder.by.country.year
## 0
## GINI.index..World.Bank.estimate.
## 141
## GINI.index..World.Bank.estimate...average.2000.15
## 16
## gini.of.household.income.reported.in.Gallup..by.wp5.year
## 0
Due to the high number of missing data in GINI index (World Bank estimate) and that the data imputation might lead to wrong conclusion, we are not considering this index. We are first analyzing the factors of Happiness based on the survey questions and economy indices, we do not include “GINI index (World Bank estimate), average 2000-15” (Column 13) and “gini of household income reported in Gallup, by wp5-year” (Column 14).Then we omit the “Happiness score”, since we are doing further analyses on the factors contributing to the happiness score.
df_2017_nscale <- df_2017_nscale[, -13:-14]
df_2017_nscale <- df_2017_nscale[, -15]
df_2017_nscale <- df_2017_nscale[, -13:-14]
df_2017 <- df_2017[, -13:-14]
df_2017 <- df_2017[, -15]
df_2017 <- df_2017[, -13:-14]
We’ve named the data frame “df_2017_nscale” to make it clear that this data frame is scaled and standardized in future analysis. Now we need to make sure the data is scaled to process further analyses. Since each country has its owe status in economy and unique responses in those survey questions, we are not imputing missing data. Instead, we omit the countries (observations) with missing data. We still have 113 countries (observations) that have complete data to work on.
df_2017$Healthy.life.expectancy.at.birth <- scale(df_2017$Healthy.life.expectancy.at.birth)
df_2017$Life.Ladder <- scale(df_2017$Life.Ladder)
df_2017$Log.GDP.per.capita <- scale(df_2017$Log.GDP.per.capita)
df_2017_omit1 <- na.omit(df_2017)
df_2017_omit <- na.omit(df_2017)
We then use the region indicator included data and calculate the means of the countries in this region to create the data frame in terms of each region.
df_region <- merge(df, df2)
df_region <- df_region[, 1:20]
df_region <- as.data.frame(subset(df_region, df_region$year == 2017))
df_region <- df_region[, -17]
df_region <- df_region[, -13:-14]
df_region <- df_region[, -10:-11]
df_region <- df_region[, -14]
df_region <- df_region[, -11:-12]
df_region <- na.omit(df_region)
df_region <- aggregate(df_region[,3:11], list(df_region$Region.indicator), mean)
We want to find out how those contributors to the happiness index are related to each other. Based on the data of 2017, the correlation matrix would be a good way to visualize these relations. The larger the circle, the more the corresponding variables are related to each other. Blue means a positive relation, and red means a negative relation.
df_2017_nscale2 <- merge(df_2017_nscale, df3)
df_2017_nscale2 <- df_2017_nscale2[,1:15]
df_2017_nscale2<- na.omit(df_2017_nscale)
corr_data <- df_2017_nscale2 %>%
group_by(Life.Ladder,Log.GDP.per.capita,
Healthy.life.expectancy.at.birth,Social.support,Freedom.to.make.life.choices,
Generosity, Perceptions.of.corruption, Confidence.in.national.government,
GINI.index..World.Bank.estimate...average.2000.15)
corr_data <- corr_data[, -14]
corr_data <- corr_data[, -10:-11]
colnames(corr_data)[11] <- "GINI.Index"
colnames(corr_data)[10] <- "Confidence.in.Gov"
colnames(corr_data)[6] <- "Life.Expectancy"
colnames(corr_data)[7] <- "Freedom"
colnames(corr_data)[9] <- "Corruption"
colnames(corr_data)[4] <- "GDP Per Capita"
corrplot(cor(corr_data[,3:11]), tl.cex = 0.8)
From the correlation matrix, we can see some interesting correlations:
The top left is where most blue and large circles are. Life Ladder, GDP Per Capita, Social support and Life Expectancy are highly correlated to each other. Freedom is also positively related to those 4 variables but with weaker correlation.
The red circles are mainly at the bottom left (symmetrical with the top right). Corruption, Confidence in Government and GINI index are negatively correlated to each other. Notice that Life Expectancy is also negatively correlated to those three variables.
Freedom is negatively correlated with Corruption, which is the only negative one among the correlations between Freedom and other variables.
Generosity is not significantly correlated to most of the variables (Life ladder, GDP Per Capita, Social Support. It is negatively related to Corruption and positively related to Confidence in Government. Generosity is also weakly positively related to the GINI index.
Corruption is mostly negatively related to other variables except GINI index. Higher Corruption means higher GINI Index, which is higher income inequality.
The correlation between GINI Index and Confidence in Government is very weak, which means that GINI Index is barely correlated Confidence in Government.
It is fairly difficult to visualize all the variables one by one for all the countries. But using principle component analysis, we are able to reduce redundancy in the variables and hence reduce the dimensionality of data visualization. We use only less number of variables, which are principle components, to replace the large number of original variables. And those principle components are linear combination of original variables, which keeps the characteristics of each observation. First, we use the scree plot to find the optimal number of principle components. The red line represents the cut-off, and we keep 3 principle components.
rownames(df_2017_omit) <- df_2017_omit[,1]
df_2017_omit[,1] <- NULL
fit <- prcomp(x = df_2017_omit[,-1], center = TRUE, scale = TRUE)
summary(fit)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2505 1.5602 1.2211 0.8285 0.78805 0.62515 0.5950
## Proportion of Variance 0.4221 0.2028 0.1243 0.0572 0.05175 0.03257 0.0295
## Cumulative Proportion 0.4221 0.6249 0.7492 0.8064 0.85813 0.89070 0.9202
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.51439 0.50525 0.45935 0.34241 0.33079
## Proportion of Variance 0.02205 0.02127 0.01758 0.00977 0.00912
## Cumulative Proportion 0.94226 0.96353 0.98111 0.99088 1.00000
screeplot(fit, npcs = 12, type = "lines")
abline(h = 1, col = "red")
Using three principle components, we approximate the happiness score with a linear model. We see that PC1 and PC3 have high significance and thus we use scatter plot to visualize PC1 and PC3, and each dot representing a country. The bi-plot of PC1 and PC3, and the number represents the country (the line number is corresponding to the original df1). However, the bi-plot is not very clear because of the large number of observations and variables. We then only create a scatter plot with PC1 and PC3, and each dot represent a country.
newdf <- data.frame(country = df_2017_omit[,1], fit$x[,1:3])
#rownames(df3) <- df3$country
#df3$country <- NULL
#newdf <- merge(newdf, df3)
#newdf <- newdf[,1:5]
#fit.lm <- lm(Happiness.score ~ .,newdf[,-1])
#summary(fit.lm)
#rownames(newdf) <- newdf$country
newdf[,1] <- NULL
biplot(fit)
However, the bi-plot is not very clear because of the large number of observations and variables. But we can still see the correlation of the variables here similar to those we found in the correlation matrix. While pointing to the The smaller angle between two variables means a more positive correlation. If the angle is over 90 degree, it means that the larger the angle, the stronger the negative correlation.
We then only create a scatter plot with PC1 and PC3, and each dot represent a country. Notice that there are areas that dots are more concentrated, which means that the distance between two dots are small and they are similar. Therefore it is interesting to conduct cluster analysis.
plot(fit$x[,1],fit$x[,3],xlab="PC1", ylab="PC3", main="Principle Components",pch=20)
As mentioned above, we would like to find out what countries are similar in terms of the factors of happiness. The similar countries would be in one cluster. Similar to principle component analysis, we need to find the optimal number of cluster (under the k-means clustering method). From the graph, we see that there is a “turning point” at 3, where the slopes become smoother.
fviz_nbclust(df_2017_omit[, -1], FUN = kmeans, method = "wss")
Thus, we conduct the later analysis by using 3 clusters. We can see three clusters and corresponding profiles. More clear plots are shown in the next section.
fit.km <- kmeans(df_2017_omit[, -1], 3, nstart = 25)
fviz_cluster(fit.km, df_2017_omit[, -1])
means <- as.data.frame(fit.km$centers)
means$cluster <- row.names(means)
df_long <- gather(means, key = "variable", value = "value",
Life.Ladder:gini.of.household.income.reported.in.Gallup..by.wp5.year)
ggplot(data = df_long, aes(x = variable, y = value, group = cluster, color = cluster, shape = cluster)) +
geom_point(size = 3) +
geom_line(size = 1) +
labs(title = "Profiles for Happiness Clusters") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
fit.km$cluster
## Albania Argentina Armenia
## 3 3 3
## Australia Austria Azerbaijan
## 1 1 3
## Bangladesh Belarus Belgium
## 2 3 1
## Benin Bolivia Bosnia and Herzegovina
## 2 3 3
## Botswana Brazil Bulgaria
## 2 3 3
## Burkina Faso Cameroon Chad
## 2 2 2
## Chile Colombia Congo (Brazzaville)
## 1 3 2
## Congo (Kinshasa) Costa Rica Croatia
## 2 1 3
## Cyprus Czech Republic Denmark
## 1 1 1
## Dominican Republic Ecuador El Salvador
## 3 3 3
## Estonia Ethiopia Finland
## 3 2 1
## France Gabon Georgia
## 1 3 3
## Germany Ghana Greece
## 1 2 3
## Guatemala Guinea Haiti
## 3 2 2
## Honduras Hungary Iceland
## 3 3 1
## India Indonesia Iraq
## 2 3 3
## Ireland Israel Italy
## 1 1 1
## Ivory Coast Jamaica Japan
## 2 3 1
## Kazakhstan Kenya Kyrgyzstan
## 3 2 3
## Laos Latvia Lebanon
## 2 3 3
## Liberia Luxembourg Macedonia
## 2 1 3
## Madagascar Malawi Mali
## 2 2 2
## Mauritania Mauritius Mexico
## 2 3 1
## Moldova Mongolia Montenegro
## 3 3 3
## Mozambique Myanmar Namibia
## 2 2 2
## Nepal Netherlands Nicaragua
## 2 1 3
## Niger Nigeria Norway
## 2 2 1
## Pakistan Panama Peru
## 3 1 3
## Philippines Portugal Romania
## 3 1 3
## Russia Senegal Serbia
## 3 2 3
## Sierra Leone Slovakia Slovenia
## 2 1 1
## South Africa South Korea Spain
## 2 1 1
## Sri Lanka Sweden Switzerland
## 3 1 1
## Tajikistan Tanzania Thailand
## 3 2 3
## Togo Tunisia Turkey
## 2 3 3
## Uganda Ukraine United Kingdom
## 2 3 1
## United States Uruguay Uzbekistan
## 1 1 3
## Zambia Zimbabwe
## 2 2
Now we get which cluster each country is in. From the Profiles of Happiness Clusters, we see that the main differentiation occurs at healthy life expectancy at birth, life ladder, and log GDP per capita. The healthy life expectancy at birth creates the largest gap between Cluster 1 (Orange line, eg. the United States, Austria, Iceland, Japan, etc.) and Cluster 2 (Green line, eg. Tanzania, Laos, Ethiopia, Colombia, etc.). And Cluster 3 (Blue line) is in the middle of the other two clusters.
Moreover, Cluster 1 (Orange line) countries have relatively high freedom to make life choices, low perceptions of corruption, high positive effects and high social support. Cluster 2 (Green line) countries have relatively high confidence in national government, high GINI Index and relatively low social support. Cluster 3 (Blue line) has relatively the lowest generosity, high perceptions of corruption.
Since we can have different number of clusters, we adapt another clustering method to better show the clusters when choosing different number of clusters or different levels. Hierarchical clustering is based on the euclidean distance of all variables of each country(observation). The results of this method, hierarchical clustering, is presented in a dendrogram.
d <- dist(df_2017_omit[, -1], method = "euclidean")
fit.hc <- hclust(d, method = "complete")
plot(fit.hc, hang = -1, cex = 0.8)
rect.hclust(fit.hc, k = 3, border = "red")
Note that some countries that are close to each other on the dendrogram are also closed geographically. For example, on the left hand side, there are mainly African countries. Then Japan and South Korea are next to each other. In the middle are European countries. In this case, we can do further analysis on the more general regions.
Similarly, we run cluster analysis for different regions to get a group of more general clusters. Since there are only 10 variables (observations), we process the hierarchical clustering first and get the dendrogram. We then use k-means clustering to find the profiles of the different clusters.
rownames(df_region) <- df_region$Group.1
df_region$Group.1 <- NULL
df_region_scale1 <- data.frame(scale(df_region))
d_region <- dist(df_region_scale1, method = "euclidean")
fit.hc_region <- hclust(d_region, method = "complete")
plot(fit.hc_region, hang = -1, cex = 0.8)
rect.hclust(fit.hc_region, k = 4, border = "red")
From the dendrogram, we see that Sub-Saharan Africa is not close to other regions. While North America and ANZ (Australia and New Zealand) is close to Western Europe, those two regions are actually pretty far away from each other geographically. The next cluster includes South Asia and Southeast Asia, which are geometrically close. Another cluster includes East Asia, Central and Eastern Europe, Middle East and North Africa, Commonwealth of Independent States, and Latin America and Caribbean.
We then use k-means clustering to find the profiles of the different clusters, using 4 clusters. (Readers are also able to find out the k-means clustering by different numbers of clusters.)
# by region
df_region_scale <- data.frame(scale(df_region))
fit.km_region <- kmeans(df_region_scale, 4, nstart = 25)
fviz_cluster(fit.km_region, df_region_scale)
means_region <- as.data.frame(fit.km_region$centers)
means_region$cluster <- row.names(means_region)
df_long_region <- gather(means_region, key = "variable", value = "value",
Life.Ladder:GINI.index..World.Bank.estimate...average.2000.15)
ggplot(data = df_long_region, aes(x = variable, y = value, group = cluster, color = cluster, shape = cluster)) +
geom_point(size = 3) +
geom_line(size = 1) +
labs(title = "Profiles for Happiness Clusters by Region") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
fit.km_region$cluster
## Central and Eastern Europe Commonwealth of Independent States
## 3 3
## East Asia Latin America and Caribbean
## 3 2
## Middle East and North Africa North America and ANZ
## 3 1
## South Asia Southeast Asia
## 4 4
## Sub-Saharan Africa Western Europe
## 4 1
From the profiles of clusters by region, we see larger differences compared to the profiles of clusters by country. But we can see similar results.
Cluster 4 (Southeast Asia, South Asia, Sub-Saharan Africa) has high confidence in national government, low healthy life expectancy at birth, low life ladder, low log GDP per capita and low social support, which is similar to Cluster 2 in the analysis by country. This cluster also have relatively high generosity, only lower than Cluster 1 (America and ANZ, Western Europe).
Cluster 3 (Middle East and North Africa, Central Eastern Europe, East Asia, Commonwealth of Independent States) has low freedom to make life choice and high level of perceptions of corruption. Other variables of Cluster 3 are in between peaks, which is similar to Cluster 3 in the previous analysis by country.
Cluster 2 only includes Latin America and Caribbean, which is pretty unique in all the regions. It has low generosity, high GINI index. Other variables are mostly close to Cluster 3.
Cluster 1 (America and ANZ, Western Europe) has high freedom, high generosity, high healthy expectancy, high life ladder, high log GDP, low perceptions of corruption and high social support. This characteristics are close to Cluster 1 in our previous analysis by country.
According to the correlation matrix, principle component analysis, we find the strong correlation among Life Ladder, GDP Per Capita, Social support and Life Expectancy. It is surprising but also reasonable to see that individuals’ health, supports from family and friends and current status of life are highly related to the country’s GDP per capita. If the national government would boost its residents’ happiness, a better economy is essential. Another important aspect for the government to raise the happiness ranking is to solve the corruption issue that has a negative influence on the the happiness of its citizens.
From the cluster analyses by country and region, we see the similarity of countries that are geographically close to each other, especially in Asia and Africa. This may be because of climates, trades, immigration, religion, even culture and history. But when one country in the region starts to develop faster, its geographical neighbors may follow the growth. Neighbor countries should corporate with each other to seek higher economy development, as well as better well-being, for example, improvement of medical technology and healthcare system.
Further research can find out more how those variables change over the years and how the changes are related with the changes in happiness level, which is able to provide policy-makers more insights.
Helliwell, J., Layard, R., & Sachs, J. (2018). World Happiness Report 2018, New York: Sustainable Development Solutions Network.